Austin Animal Shelter Analysis
Posted on Wed 21 December 2016 in Projects
Austin Animal Shelter¶
Animal welfare is something that is really close to my heart so I am really excited to go through this dataset for analysis. This data was posted on Kaggle but I did not discover the dataset until after the competition expired. The goal posted by Austin Animal Shelter for this data was to predict the outcome of each animal.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import normalize
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import confusion_matrix
import category_encoders as ce
import xgboost as xgb
import seaborn as sns
%matplotlib inline
Load Data¶
df = pd.read_csv('train.csv')
orig_df = pd.read_csv('train.csv')
df.shape
df.head()
df.columns
Convert age column to age in years¶
The age column contains strings and the unit of age is not consistent. So, we will convert the age column into age in years. There are some entries with no age value. We will first fill in those null values with -99, and then replace them with the average age.
df['AgeuponOutcome'].unique()
Ages = df['AgeuponOutcome'].astype(str)
y = [float(age.split()[0]) if 'year' in age
else float(age.split()[0])/52. if 'week' in age
else float(age.split()[0])/12. if 'month' in age
else float(age.split()[0])/352. if 'week' in age
else -99.0
for age in Ages]
y[y == -99] = np.mean(y[y != -99])
Here, we replace the missing ages with the average age of the animals¶
df['AgeuponOutcome'] = pd.DataFrame({'AgeuponOutcome': y})
df.head()
Basic Stats for Dogs vs Cats¶
Average age upon Outcome¶
df[df['AnimalType'] == 'Dog']['AgeuponOutcome'].mean()
df[df['AnimalType'] == 'Cat']['AgeuponOutcome'].mean()
Number of dogs vs cats¶
len(df[df['AnimalType'] == 'Dog'])
len(df[df['AnimalType'] == 'Cat'])
Summary on basic stats: The average age for a dog (2.75 yrs) is greater than the average age of cats (1.36 yrs) upon the outcome. The shelter also has sheltered more dogs than cats over the course of 3 years.
Create columns for male vs female, fixed vs intact¶
df['SexuponOutcome'].unique()
df['SexuponOutcome'] = df['SexuponOutcome'].replace(np.nan,'Unknown')
sex = ['male' if 'Male' in df['SexuponOutcome'].iloc[i]
else 'female' if 'Female' in df['SexuponOutcome'].iloc[i]
else 'unknown'
for i in range(len(df['SexuponOutcome']))]
spayed_neutered = ['fixed' if any(x in df['SexuponOutcome'].iloc[i] for x in ['Neutered','Spayed'])
else 'not_fixed' if 'Intact' in df['SexuponOutcome'].iloc[i]
else 'unknown_fixed'
for i in range(len(df['SexuponOutcome']))]
set(spayed_neutered)
set(sex)
df['Fixed'] = pd.DataFrame({'fixed':spayed_neutered})
df['Sex'] = pd.DataFrame({'Sex': sex})
df = df.drop('SexuponOutcome',axis = 1)
Simplify the Breed Feature¶
df.head()
len(df['Breed'].unique())
Currently, there are 1380 different breed classifications among cats and dogs¶
df['Breed'].value_counts()[0:10]
df[df['AnimalType'] == 'Cat']['Breed'].value_counts()[0:10]
len(df[df['AnimalType'] == 'Cat']['Breed'].unique())
len(df[df['AnimalType'] == 'Dog']['Breed'].unique())
Domestic Shorthair Mix seems to be a classification for cats only¶
- Cat breeds seem to be somewhat limited with 60 different breeds while dogs have 1320 different classifications. We will only simplify dog breeds.
df[df['AnimalType'] == 'Dog']['Breed'].value_counts()[0:20]
Let's remove the "Mix" from the dog breeds and also take only the first breed from classifications in the form "breed 1 / breed 2", assuming that the first breed listed is the dominant breed¶
import re
breeds = list(df[df['AnimalType'] == 'Dog']['Breed'])
dog_breeds = [re.sub(' Mix', '', dog) if 'Mix' in dog else dog.split("/")[0] for dog in breeds]
# for i in range(len(dog_breeds)):
# if ('Mix' in dog_breeds[i]):
# dog_breeds[i] = re.sub(' Mix', '', dog_breeds[i])
# else:
# dog_breeds[i] = dog_breeds[i].split("/")[0]
len(set(dog_breeds))
dog_breeds_df = pd.DataFrame({'Breed':dog_breeds})
We were able to narrow down the different classes of dog breeds from 1320 to 188¶
dog_breeds_df['Breed'].value_counts()[0:10]
Now let's replace all the dog breed classifications from the original dataframe with the new breed classifications¶
df.ix[df.AnimalType == 'Dog', 'Breed'] = dog_breeds
len(df['Breed'].unique())
x = df['Breed'].value_counts().index.tolist()
y = df['Breed'].value_counts().tolist()
common_breed = df['Breed'].value_counts().index.tolist()[0:10]
pd.crosstab(df[df['Breed'].isin(common_breed)]['Breed'],df['OutcomeType'])
Simplify Dog Color Column¶
The color column has 366 different colors. We will simplify this column similar to the breed column, taking the first color as the dominant color.¶
len(df['Color'].unique())
len(df[df['AnimalType'] == 'Dog']['Color'].unique())
df[df['AnimalType'] == 'Dog']['Color'].value_counts()[0:15]
df[df['AnimalType'] == 'Cat']['Color'].value_counts()[0:15]
# Take the first color
colors = df['Color']
color = [c.split("/")[0] for c in df['Color']]
len(pd.DataFrame({'Color':color})['Color'].unique())
pd.DataFrame({'Color':color})['Color'].value_counts()[0:15]
Taking the first listed color as the dominant color, we reduced the number of possibilites for color definition from 366 to 57
df['Color'] = color
df.head()
Simplify the date and time column¶
df['DateTime'].head()
from datetime import datetime
d = df['DateTime'][0].split()[0]
dates = [df['DateTime'][i].split()[0] for i in range(len(df['DateTime']))]
date = [datetime.strptime(d,'%Y-%m-%d') for d in dates]
# season = ['Winter' if d.month % 11 <= 2
# else 'Spring' if (d.month % 11 >= 3) & (d.month % 11 <= 5)
# else 'Summer' if (d.month % 11 >= 6) & (d.month % 11 <= 8)
# else 'Fall'
# for d in date]
week = [1 if d.day <= 7
else 2 if (d.day > 7) & (d.day <= 14)
else 3 if (d.day > 14) & (d.day <= 21)
else 4
for d in date]
# df['Season'] = season
df['Week'] = week
df.head()
Create Indicator Column whether the animal was given a name or not¶
df['Name'] = df['Name'].isnull() * 1.
df = df.rename(columns = {'Name':'NoName'})
Create column for indicator of Animal Type¶
df['AnimalType'].unique()
df['AnimalType'] = (df['AnimalType'] == 'Dog') * 1.
Create Dummy Variables for Columns with Categorical Variables¶
categorical_var = ['Color', 'Breed','Fixed', 'Sex']
binary = ce.binary.BinaryEncoder(cols = categorical_var)
binary.fit(df)
dat = binary.transform(df)
# dat = pd.concat([df.drop(categorical_var, axis = 1), pd.get_dummies(df[categorical_var])], axis = 1)
pd.set_option('display.max_columns', 100)
dat.head()
Check dimensions of data frame to check dummy function¶
Create Labels¶
lab = dat['OutcomeType']
lab.value_counts()
le = LabelEncoder()
labels = le.fit_transform(dat['OutcomeType'])
Drop columns not needed for models¶
dat.drop(['AnimalID','DateTime','OutcomeType','OutcomeSubtype'],axis = 1).dtypes.unique()
dat = dat.drop(['AnimalID','DateTime','OutcomeType','OutcomeSubtype', 'AnimalType'],axis = 1)
Create Training and Testing Sets¶
X_train, X_test, Y_train, Y_test = train_test_split(dat, labels, test_size = 0.25)
df.OutcomeType.value_counts()/float(df.shape[0])
np.bincount(Y_train)/float(len(Y_train))
np.bincount(Y_test)/float(len(Y_test))
We now need to normalize the training and testing set.
X_train = normalize(X_train, axis = 1)
X_test = normalize(X_test, axis = 1)
Logistic Regression Grid Search and Cross Validation¶
log_model = LogisticRegression(penalty = 'l1')
params = {'C':[0.25, 0.5, 0.75, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5]}
gs_model = GridSearchCV(log_model, params)
gs_model.fit(X_train, Y_train)
gs_model.best_params_
C_val = gs_model.best_params_['C']
best_log_model = LogisticRegression(C = C_val, penalty = 'l1')
best_log_model.fit(X_train, Y_train)
best_log_model.score(X_test, Y_test)
log_loss(Y_test, best_log_model.predict_proba(X_test))
cm = confusion_matrix(Y_test, best_log_model.predict(X_test))
plt.matshow(cm)
plt.colorbar()
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
print cm
le.classes_
Gaussian Naive Bayes¶
nb_model = GaussianNB()
nb_model.fit(X_train, Y_train)
nb_model.score(X_test, Y_test)
Gradient Boosting¶
Grid Search for Gradient Boosting¶
# params = {'n_estimators': [200, 300, 450], 'learning_rate':[0.1, 0.5],
# 'max_depth':[5, 6, 7, 8], 'max_features':['sqrt', .40], 'min_samples_split':[2, 20, 30, 50, 100]}
params = {'n_estimators':[200, 300], 'max_features':['sqrt', 'log2'], 'max_depth':[5, 7]}
grid_search_gb = GridSearchCV(GradientBoostingClassifier(), params, verbose = 1)
grid_search_gb.fit(X_train, Y_train)
grid_search_gb.best_params_
num_estimators = grid_search_gb.best_params_['n_estimators']
depth = grid_search_gb.best_params_['max_depth']
features = grid_search_gb.best_params_['max_features']
best_gb_model = GradientBoostingClassifier(n_estimators = num_estimators, learning_rate = 0.1, max_depth = depth)
best_gb_model.fit(X_train, Y_train)
best_gb_model.score(X_test, Y_test)
log_loss(Y_test, best_gb_model.predict_proba(X_test))
cm = confusion_matrix(Y_test, best_gb_model.predict(X_test))
plt.matshow(cm)
plt.colorbar()
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
print cm
le.classes_
feat_imp = pd.DataFrame({'features':dat.columns.values,'values':best_gb_model.feature_importances_})
top_feat = feat_imp.sort('values').iloc[-10:]
top_feat
plt.figure(figsize = (20,10))
plt.title('Top 10 Features')
plt.xticks(fontsize = 15)
sns.barplot(top_feat['features'], 100.0 * top_feat['values']/np.max(best_gb_model.feature_importances_))
XgBoost¶
params = {'n_estimators':[200, 300], 'max_depth':[4, 5, 6], 'subsample':[1, 0.8]}
gs_xgb = GridSearchCV(xgb.XGBClassifier(), params, verbose = 1)
gs_xgb.fit(X_train, Y_train)
gs_xgb.best_params_
depth = gs_xgb.best_params_['max_depth']
estimators = gs_xgb.best_params_['n_estimators']
sub_sample = gs_xgb.best_params_['subsample']
xgb_model = xgb.XGBClassifier(n_estimators = estimators, max_depth = depth, subsample = sub_sample,
objective = 'multi:softmax')
xgb_model.fit(X_train, Y_train, eval_metric = 'merror', eval_set = [(X_test, Y_test)],
early_stopping_rounds = 150,verbose = False)
xgb_model.score(X_test, Y_test)
log_loss(Y_test, xgb_model.predict_proba(X_test))
Majority Vote¶
mv_model = VotingClassifier([('nb',nb_model), ('xgb',best_gb_model), ('log',best_log_model)], voting = 'soft')
mv_model.fit(X_train, Y_train)
mv_model.score(X_test, Y_test)
log_loss(Y_test, mv_model.predict_proba(X_test))
ZeroR Baseline¶
lab.value_counts()[0]/float(len(labels))
Using the zero rule baseline, we would get a 40% accuracy by "guessing" all animals will be adopted. After testing numerous models, the best result is 64% prediction accuracy.